"edtwt," or "Eating Disorder Twitter" is a group of users that discuss their experiences with a variety of eating disorders, including (but not limited to) anorexia nervosa, bulimia, binge eating disorder, and other undiagnosed conditions, or 'ednos'.
Eating disorders have disproportionally affected young women, and cases have heavily increased over the past few years. Online communities which encourage eating disorders (ED) are extremely dangerous. Social media has made it easier for these pro-ED messages to spread, reaching vulnerable populations ranging from healthy individuals who may be influenced to engage in ED behaviors to individuals who may already have an ED. Recent research suggests a link between viewing online ED content and engaging in offline ED behavior, and the age group most affected by ED are the young users of platforms like Twitter.
Some platforms have attempted to censor dangerous content like this, but it can result in 'over-warning' users with 'triggering' hashtags or topics, even those who are seeking help or support. The warnings and interventions can therefore quickly lose their efficacity and even cause more harm than good. It's important that we truly understand the nature of online communities before introducing interventions to avoid issues like this. In particular, we need to know more about the types of ED-related information people using social media are being exposed to.
In this study, I collected Tweets over a roughly-weeklong period from Twitter to investigate the 'daily' conversations of edtwt. I used the hashtags used in this study by Dawn B. Branley and Judith Covey, which they scraped from the website www.hashtagify.me, to filter the tweets I searched using Twitter's API.
Twitter's official API allows developers and researchers to collect tweets, as well as information about them and their authors, to study the platform's handling of data and communities. I signed up for a developer account, and used this API to collect my tweets.
As mentioned earlier, I filtered Twitter's 'search' endpoint using the hashtags from Branley and Covey's 2017 paper.
searchterms = ['proana', '#proana', 'pro-ana', 'pro ana', 'anorexia', '#anorexia',
'anorexic', '#anorexic', '#promia', 'bulimia', '#bulimia', 'bulimic',
'#eatingdisorder', 'eating disorder', '#edproblems', 'ednos', '#ednos',
'thinspiration', '#thinspiration', 'thinspo', '#thinspo']
In order to use Twitter's API, I had to sign in with my unique authorization tokens. We can begin by defining these, as well as importing a few modules we'll need to get these tweets.
import requests
import json
import twitter
bearer_token = "AAAAAAAAAAAAAAAAAAAAAP65VwEAAAAA8Yue0XLWi9ncdBWwvK4Gb6DyuSc%3D6Z4bsZvdf2f9uLcTHX7PfxiHr4k0SxOytKvwnSvHBRXbFx9Sy3"
twitter_consumer_key = 'udAq5FCrBSbCnpzoD5xgkzM4r'
twitter_consumer_secret = 'J3HqWGFU2H5jfWtvmJQmXTSdBkKlb0iDXbc7kAp064oKl4EJ3B'
twitter_access_token = '1461340400984244225-3eya4Sxj51bPRzcXXVfcZy5bXBj9ei'
twitter_access_secret = 'Z4P3WKGSCqGbSYqu6YeQ4Ad5ZvLwWNrsi1yh8Fp6l84oB'
Using these authorizations, we can build our Twitter API URL.
We can use our bearer_token to authorize ourselves in the header of the API URL, and then define which datapoints we want returned in our response.
def auth():
"""authorizes the user for the Twitter API, using the bearer_token"""
return bearer_token
def create_headers(bearer_token):
"""creates the 'header' aspect of the API URL, using the bearer_token"""
headers = {"Authorization": "Bearer {}".format(bearer_token)}
return headers
def create_url(next_token):
"""creates the URL to request from Twitter. This function specifically
uses the recent search endpoint, and then filters the results.
"""
search_url = "https://api.twitter.com/2/tweets/search/recent"
# filter results on desired hashtags/keywords, from 11/30/2021 (the past week),
# and define which extra features we want to be returned in our response
query_params = {'query': '(proana OR #proana OR pro-ana OR pro ana OR anorexia OR #anorexia OR anorexic OR #anorexic OR #promia OR bulimia OR #bulimia OR bulimic OR #eatingdisorder OR eating disorder OR #edproblems OR ednos OR #ednos OR thinspiration OR #thinspiration OR thinspo OR #thinspo) -is:retweet lang:en',
'start_time': '2021-12-06T00:00:00.000Z',
'max_results': 100,
'expansions': 'author_id,in_reply_to_user_id,geo.place_id',
'tweet.fields': 'id,text,author_id,in_reply_to_user_id,geo,conversation_id,created_at,lang,public_metrics,referenced_tweets,reply_settings,source',
'user.fields': 'id,name,username,created_at,description,public_metrics,verified',
'place.fields': 'full_name,id,country,country_code,geo,name,place_type',
'next_token': next_token
}
return (search_url, query_params)
Once we have authorized ourselves, we can actually request the tweet data from the Twitter API, using the requests library and the URL we just built. If we are paginating our results, we will use the next_token to access the next page of tweets.
def connect_to_endpoint(url, headers, params, next_token = None):
"""using the requests module and the next_token, if defined, to 'get' the tweets
from Twitter using the parameter's API URL, returning the JSON response.
"""
# if we're paginating through our responses
params['next_token'] = next_token
response = requests.request("GET", url, headers = headers, params = params)
# if there was a connection error
if response.status_code != 200:
raise Exception(response.status_code, response.text)
return response.json()
This is the complicated part. Using the JSON data from the connect_to_endpoint URL, we will loop through each tweet, grab the desired datapoints, and append each tweet's new data to a CSV file.
import dateutil
def append_to_csv(json_response, fileName):
"""loops through a JSON response, writing information
about each tweet to a specified CSV file.
"""
# count how many tweets we add
counter = 0
# Open OR create the target CSV file
csvFile = open(fileName, "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)
# Loop through each tweet, get the desired datapoints
for tweet in json_response['data']:
# 1. Author ID
author_id = tweet['author_id']
# 2. Time created
created_at = dateutil.parser.parse(tweet['created_at'])
# 3. Geolocation
if ('geo' in tweet):
geo = tweet['geo']['place_id']
else:
geo = " "
# 4. Tweet ID
tweet_id = tweet['id']
# 5. Tweet metrics
retweet_count = tweet['public_metrics']['retweet_count']
reply_count = tweet['public_metrics']['reply_count']
like_count = tweet['public_metrics']['like_count']
quote_count = tweet['public_metrics']['quote_count']
# 6. in reply to ___
if 'in_reply_to_user_id' in tweet:
in_reply_to_user_id = tweet['in_reply_to_user_id']
else:
in_reply_to_user_id = " "
# 7. source
source = tweet['source']
# 8. Tweet text
text = tweet['text']
# 9. hashtags used
if 'hashtags' in tweet:
hashtags = tweet['entities']['hashtags']
else:
hashtags = []
# Assemble all data in a list
res = [author_id, created_at, geo, tweet_id, like_count,
quote_count, reply_count, retweet_count, in_reply_to_user_id,
source, text, hashtags]
# Append the result to the CSV file
csvWriter.writerow(res)
counter += 1
# When done, close the CSV file
csvFile.close()
Using all the helper functions defined previously, we'll actually gather our tweets now:
I originally printed out endpoints to see whether each request connected or not and the total number of tweets, but I have commented out those print statements to save space in the notebook.
import time
import csv
#Inputs for tweets
bearer_token = auth()
headers = create_headers(bearer_token)
#Total number of tweets we collected from the loop
total_tweets = 0
# Create file
csvFile = open("tweetData.csv", "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)
#Create headers for the data we want to save
csvWriter.writerow(['author id', 'created_at', 'geo', 'id',
'like_count', 'quote_count', 'reply_count','retweet_count',
'in_reply_to_user_id','source','tweet', 'hashtags'])
csvFile.close()
# Inputs
max_count = 100 # Max tweets per time period
flag = True # Whether we have more tweets to scrape
next_token = None # Token to access next pages of tweets to scrape
# Check if flag is true
while flag:
# get the response from Twitter's API
url = create_url(next_token)
json_response = connect_to_endpoint(url[0], headers, url[1], next_token)
result_count = json_response['meta']['result_count']
# if we have >1 page of results
if 'next_token' in json_response['meta']:
# Save the token to use for next call
next_token = json_response['meta']['next_token']
if result_count is not None and result_count > 0 and next_token is not None:
append_to_csv(json_response, "tweetData.csv")
total_tweets += result_count
time.sleep(5)
# If no next token exists
else:
if result_count is not None and result_count > 0:
append_to_csv(json_response, "tweetData.csv")
total_tweets += result_count
time.sleep(5)
#Since this is the final request, turn flag to false to move to the next time period.
flag = False
time.sleep(5)
# print("Total number of results: ", total_tweets)
We'll read the results from the CSV file, so we don't need to scrape again in the future, and store them in a dataframe for easy analysis.
import pandas as pd
df = pd.read_csv("tweetData.csv")
df
| author id | created_at | geo | id | like_count | quote_count | reply_count | retweet_count | in_reply_to_user_id | source | tweet | hashtags | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 386529892 | 2021-12-12 14:53:41+00:00 | 1470044188955840529 | 9 | 0 | 0 | 9 | Twitter for iPhone | Conscience. Harmony anorexic-alcoholic-childle... | [] | ||
| 1 | 1470040865242599433 | 2021-12-12 14:52:56+00:00 | 1470044000442793986 | 0 | 0 | 0 | 0 | Twitter for Android | #legspo Its a great inspiration for me. #thins... | [] | ||
| 2 | 1153148565960740865 | 2021-12-12 14:52:56+00:00 | 1470043999088091141 | 0 | 0 | 0 | 0 | Twitter for iPhone | want more ed + sh twt moots :D\r\n\r\n- 16\r\n... | [] | ||
| 3 | 1083253588216999937 | 2021-12-12 14:52:20+00:00 | 1470043847749218304 | 0 | 0 | 0 | 0 | Twitter Web App | Collingwood mom opens up on family’s ‘terrifyi... | [] | ||
| 4 | 1384251024764923912 | 2021-12-12 14:52:06+00:00 | 1470043791188967427 | 0 | 0 | 0 | 0 | 1432055473960890371 | Twitter for iPhone | @solskaaa like i said thats a fair point, but ... | [] | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 12015 | 1440298875084959759 | 2021-12-06 00:04:17+00:00 | 1467646039196942340 | 1 | 0 | 0 | 0 | 1300123573772849155 | Twitter for iPhone | @killingklara Thinspo era | [] | |
| 12016 | 1013185185070886913 | 2021-12-06 00:02:25+00:00 | 1467645568122073091 | 0 | 0 | 0 | 0 | Twitter Web App | I came home and my mom made me a Burger but it... | [] | ||
| 12017 | 1021458549434667009 | 2021-12-06 00:00:40+00:00 | 1467645128978403329 | 0 | 0 | 0 | 0 | Twitter for Android | tw: eating disorder\r\n\r\nreminder that not e... | [] | ||
| 12018 | 16090125 | 2021-12-06 00:00:04+00:00 | 1467644975345250304 | 0 | 0 | 0 | 0 | Airtime Pro | Now playing A Study in Vastness by Ana Roxanne... | [] | ||
| 12019 | 1410866854512545793 | 2021-12-06 00:00:01+00:00 | 1467644963211124736 | 2 | 0 | 0 | 0 | Twitter Web App | 061221 thinspo https://t.co/y7eVHO8tZ2 | [] |
12020 rows × 12 columns
And clean it a little bit, like removing the tweet texts' concatination and resort by post date descending.
# remove concatination for tweet full texts
pd.set_option('display.max_colwidth',10000)
# sort df by date, descending
df.sort_values(by='created_at', ignore_index=True, inplace=True)
df.head()
| author id | created_at | geo | id | like_count | quote_count | reply_count | retweet_count | in_reply_to_user_id | source | tweet | hashtags | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1410866854512545793 | 2021-12-06 00:00:01+00:00 | 1467644963211124736 | 2 | 0 | 0 | 0 | Twitter Web App | 061221 thinspo https://t.co/y7eVHO8tZ2 | [] | ||
| 1 | 16090125 | 2021-12-06 00:00:04+00:00 | 1467644975345250304 | 0 | 0 | 0 | 0 | Airtime Pro | Now playing A Study in Vastness by Ana Roxanne on https://t.co/OICjuh6R5E | [] | ||
| 2 | 1021458549434667009 | 2021-12-06 00:00:40+00:00 | 1467645128978403329 | 0 | 0 | 0 | 0 | Twitter for Android | tw: eating disorder\r\n\r\nreminder that not everyone with an eating disorder is underweight. | [] | ||
| 3 | 1013185185070886913 | 2021-12-06 00:02:25+00:00 | 1467645568122073091 | 0 | 0 | 0 | 0 | Twitter Web App | I came home and my mom made me a Burger but it has Cheese and I Hate Cheese and she got Mad like "Fine then Don't Eat" and I was like Ok I can do that and now she's Accusing Me of Anorexia.\r\n\r\n????? | [] | ||
| 4 | 1440298875084959759 | 2021-12-06 00:04:17+00:00 | 1467646039196942340 | 1 | 0 | 0 | 0 | 1300123573772849155 | Twitter for iPhone | @killingklara Thinspo era | [] |
len(df)
12020
So, we were able to scrape 12,020 tweets to work with from 2021-12-06 to 2021-12-12.
First, we'll import the modules we'll need.
from textblob import TextBlob
import matplotlib.pyplot as plt
import numpy as np
import nltk
import pycountry
import re
import string
from wordcloud import WordCloud, STOPWORDS
from PIL import Image
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from langdetect import detect
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
percentage is a helper function to calculate percentages.
def percentage(part,whole):
"""returns a percentage"""
return 100 * float(part)/float(whole)
We'll first convert our df to a dictionary to perform the analysis, check how many tweets we have, and the content of one of these tweets.
tweets = df.to_dict('index')
print(len(tweets))
tweets[3]['tweet']
12020
'I came home and my mom made me a Burger but it has Cheese and I Hate Cheese and she got Mad like "Fine then Don\'t Eat" and I was like Ok I can do that and now she\'s Accusing Me of Anorexia.\r\n\r\n?????'
We can analyse the content of each tweet, as demonstrated above, to determine if it's positive, negative, or neutral surrounding the topic of EDs using sentiment analysis. For each tweet, we'll score it, then add it to the appropriate list depending on the tweet's sentiment score.
# initialize our counters and lists
positive = 0
negative = 0
neutral = 0
polarity = 0
tweet_list = []
neutral_list = []
negative_list = []
positive_list = []
# loop through the tweets
for i in range(len(tweets)):
tweet_list.append(tweets[i]['tweet'])
analysis = TextBlob(tweets[i]['tweet'])
score = SentimentIntensityAnalyzer().polarity_scores(tweets[i]['tweet'])
neg = score['neg']
neu = score['neu']
pos = score['pos']
comp = score['compound']
polarity += analysis.sentiment.polarity
if neg > pos:
negative_list.append(tweets[i]['tweet'])
negative += 1
elif pos > neg:
positive_list.append(tweets[i]['tweet'])
positive += 1
elif pos == neg:
neutral_list.append(tweets[i]['tweet'])
neutral += 1
# build our percentages
polarity = percentage(polarity, len(tweets))
positive = format(percentage(positive, len(tweets)), '.1f')
negative = format(percentage(negative, len(tweets)), '.1f')
neutral = format(percentage(neutral, len(tweets)), '.1f')
Our total tweets of each category:
print("total number: ",len(tweet_list))
print("positive number: ",len(positive_list))
print("negative number: ", len(negative_list))
print("neutral number: ",len(neutral_list))
total number: 12020 positive number: 4033 negative number: 5444 neutral number: 2543
And a piechart, as a visual:
#Creating PieCart
labels = ['Positive ['+str(positive)+'%]' , 'Neutral ['+str(neutral)+'%]','Negative ['+str(negative)+'%]']
sizes = [positive, neutral, negative]
colors = ['olivedrab', 'cornflowerblue', 'indianred']
patches, texts = plt.pie(sizes,colors=colors, startangle=90)
plt.style.use('default')
plt.legend(labels)
plt.title("Sentiment Analysis Result for ED Hashtags on Twitter")
plt.axis('equal')
plt.show()
Let's print a few examples of each, to see what the classifier was focusing on:
print("Positive Examples: ")
print(positive_list[62:67])
Positive Examples: ['tw: grosspo, abdl\r\n\r\ntags for reach\r\n#grosspo #fatspo #proana #promia #shtwt #edtwt https://t.co/9dB5TNsG7a', '@pausedME Antidepressants as I was suffering from extreme nausea due to anxiety. Multiple sicknesses in the family kicked it off. Also finding out I was vitamin D and B12 deficient and anaemic! All most drs wanted to do was label me anorexic. I’m now at a healthy weight.', '@shanoawarrior @OofT74480205 @me1stVegan2nd @cerebralsymphoy @thatwitchyjess7 @Khiva1 @Haoshoku @Unbornanon @TheAltruist10 @andyswarbs @Lynnia00721169 @RoodepoortL @Kizwiz6 @PeachVonT @Son_of_Space @OtherCosmonauta @JoeKerr57254356 @pro13A @bloodflowerburn @hargrump @_Aloominati_ @medicalinguist That is against the Disability act making fun of Mental Eating Disorders....You are breaking the laws of Twitter and the ADA....Sounds like you need to rethink your comment...I am the one with the Mental Eating Disorder...Wrong move', '@lilmissmister0 @dietgirrI Why did you look up a thinspo thread? Are you okay?', 'Sept vs now :((( I wish it was September again #edtwt #thinspo #bodycheck https://t.co/tUGJZfiMfS']
Hm. Positive tweets still don't seem that positive - "fatspo" is a phrase usually accompanied by an image of a "fat" person, used as motivation to continue fasting or stay thin. The second tweet, however, mentions that the user has recovered and is now at a healthy weight. So, there's a strange mix.
Let's look at the negative examples:
print("Negative Examples: ")
print(negative_list[62:67])
Negative Examples: ['Snackwells were good. Idk what everyone else remembers. The stupid anorexia panic of the 1990s destroyed so many people’s brains. Started by Naomi god damn Wolff.', "@whoopsdinosaur Don't feel bad. I was a total bitch because I was super hungry at the height of my eating disorder too. It wasn't fair to anyone around me.", 'the way literally no one cares that i think i’ve developed some kind of eating disorder and it’s literally killing me physically and mentally', "Here's latest exclu on Instagram's eating disorder problem from @theo_wayt @nypostbiz \r\nhttps://t.co/0KnaTOYelD via @nypost", 'I need to mute “intermittent fasting” and “eating disorder” because usually fine accounts have determined any caloric reduction is pathological. Any attempt to lose weight is not only disordered, but immoral. I can’t with this shit.']
These tweets are certainly more negative than the positive tweets: swearing, cries for help, etc.
And the neutral:
print("Neutral Examples: ")
print(neutral_list[62:67])
Neutral Examples: ['realized i hav anorexia nervosa☹️', 'anorexia starts tomorrow fr mark my words', 'ednos bordering on anorexia https://t.co/VrRZZlDLOh', '꒰🧠 amity blight from the owl house is an anorexic! [hc] https://t.co/qCKxUZWFdA', 'Is it just me or like I watch this tlc show called “my 3000-lb family” \r\nJust to remind myself that I don’t wanna look like that. Then I work extra hard to be skinny💕 \r\n\r\nPlease no hate this is just how I maintain control 💕\r\n\r\n#proana\r\n#thinsp0 \r\n#skinny\r\n#ed \r\n#staythin https://t.co/T61Mc4IUZe']
These are difficult to code: the first tweet, for example, shows the user is upset because of their AN diagnosis, but this is only known from the emoji. If the emoji wasn't there, this tweet could easily be interpreted as a simple statement or even a positive thing, in the user's eyes, if they are desperately trying to lose weight.
Now that we've explored our tweet examples a bit further, let's approach the sentiment analysis from a new angle using the polarity and subjectivity categories.
We'll start by cleaning our tweets and building a new dataframe without emojis, text signaling a retweet or usernames, and unnecessary /n or /t characters so our analysis can be more accurate. We'll start by defining a list of words we will want to replace. In the cleaning process, contractions are replaced with a space in the middle, so we'll fix that.
contractions = {'aren t ': "aren't ", 'can t ': "can't ", 'can t ve ': "can't've ",
'could ve ': "could've ", 'couldn t ': "couldn't ", 'couldn t ve ': "couldn't've ",
'didn t ': "didn't ", 'doesn t ': "doesn't ", 'don t ': "don't ", 'hadn t ': "hadn't ",
'hadn t ve ': "hadn't've ", 'hasn t ': "hasn't ", 'haven t ': "haven't ",
'he d ': "he'd ", 'he d ve ': "he'd've ", 'he ll ': "he'll ", 'he ll ve ': "he'll've ",
'he s ': "he's ", 'how d ': "how'd ", 'how d y ': "how'd'y ", 'how ll ': "how'll ",
'how s ': "how's ", 'i d ': "i'd ", 'i d ve ': "i'd've ", 'i ll ': "i'll ", 'i ll ve ': "i'll've ",
'i m ': "i'm ", 'i ve ': "i've ", 'isn t ': "isn't ", 'it d ': "it'd ", 'it d ve ': "it'd've ",
'it ll ': "it'll ", 'it ll ve ': "it'll've ", 'it s ': "it's ", 'let s ': "let's ", 'ma am ': "ma'am ",
'mayn t ': "mayn't ", 'might ve ': "might've ", 'mightn t ': "mightn't ", 'mightn t ve ': "mightn't've ",
'must ve ': "must've ", 'mustn t ': "mustn't ", 'mustn t ve ': "mustn't've ", 'needn t ': "needn't ",
'needn t ve ': "needn't've ", 'o clock ': "o'clock ", 'oughtn t ': "oughtn't ", 'oughtn t ve ': "oughtn't've ",
'shan t ': "shan't ", 'sha n t ': "sha'n't ", 'shan t ve ': "shan't've ", 'she d ': "she'd ",
'she d ve ': "she'd've", "she ll ": "she'll ", 'she ll ve ': "she'll've ", 'she s ': "she's ",
'should ve ': "should've ", 'shouldn t ': "shouldn't ", 'shouldn t ve ': "shouldn't've ",
'so ve ': "so've ", 'so s ': "so's ", 'that d ': "that'd ", 'that d ve ': "that'd've ",
'that s ': "that's ", 'there d ': "there'd ", 'there d ve ': "there'd've ", 'there s ': "there's ",
'they d ': "they'd ", 'they d ve ': "they'd've ", 'they ll ': "they'll ", 'they ll ve ': "they'll've ",
'they re ': "they're ", 'they ve ': "they've ", 'to ve ': "to've ", 'wasn t ': "wasn't ",
'we d ': "we'd ", 'we d ve ': "we'd've ", 'we ll ': "we'll ", 'we ll ve ': "we'll've ",
'we re ': "we're ", 'we ve ': "we've ", 'weren t ': "weren't ", 'what ll ': "what'll ",
'what ll ve ': "what'll've ", 'what re ': "what're ", 'what s ': "what's ", 'what ve ': "what've ",
'when s ': "when's ", 'when ve ': "when've ", 'where d ': "where'd ", 'where s ': "where's ",
'where ve ': "where've ", 'who ll ': "who'll ", 'who ll ve ': "who'll've ", 'who s ': "who's ",
'who ve ': "who've ", 'why s ': "why's ", 'why ve ': "why've ", 'will ve ': "will've ",
'won t ': "won't ", 'won t ve ': "won't've ", 'would ve ': "would've ", "wouldn t ": "wouldn't ",
'wouldn t ve ': "wouldn't've ", 'y all ': "y'all ", 'y all d ': "y'all'd ", 'y all d ve ': "y'all'd've ",
'y all re ': "y'all're ", 'y all ve ': "y'all've ", 'you d ': "you'd ", 'you d ve ': "you'd've ",
'you ll ': "you'll ", 'you ll ve ': "you'll've ", 'you re ': "you're ", 'you ve ': "you've ", ' s ': "'s ",
# we also want to remove single, stray apostrophes from beginning/end of words
"' ": "", " '": ""}
Now, we'll first clean the tweet text of RT, usernames, punctuations, and emojis. When we do so, our contractions will be cleaned as well, so we'll fix those next.
#Creating new dataframe and new features
tweet_list = pd.DataFrame(tweet_list)
tw_list = pd.DataFrame(tweet_list)
tw_list['text'] = tw_list[0]
#Removing RT, Punctuation etc
clean = lambda x: ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
tw_list['text'] = tw_list.text.map(clean)
tw_list['text'] = tw_list.text.str.lower()
#fix conjuctions: i ll -> i'll, for example
rep = dict((re.escape(k), v) for k, v in contractions.items())
pattern = re.compile("|".join(rep.keys()))
cont = lambda x: pattern.sub(lambda m: rep[re.escape(m.group(0))], x)
tw_list['text'] = tw_list.text.map(cont)
tw_list.head(10)
| 0 | text | |
|---|---|---|
| 0 | 061221 thinspo https://t.co/y7eVHO8tZ2 | 061221 thinspo |
| 1 | Now playing A Study in Vastness by Ana Roxanne on https://t.co/OICjuh6R5E | now playing a study in vastness by ana roxanne on |
| 2 | tw: eating disorder\r\n\r\nreminder that not everyone with an eating disorder is underweight. | tw eating disorder reminder that not everyone with an eating disorder is underweight |
| 3 | I came home and my mom made me a Burger but it has Cheese and I Hate Cheese and she got Mad like "Fine then Don't Eat" and I was like Ok I can do that and now she's Accusing Me of Anorexia.\r\n\r\n????? | i came home and my mom made me a burger but it has cheese and i hate cheese and she got mad like fine then don't eat and i was like ok i can do that and now she's accusing me of anorexia |
| 4 | @killingklara Thinspo era | thinspo era |
| 5 | @Nias_bones Oohh Eating disorder, but I advise that you don't put trigger warning for the first four. No one will see food, weight, fasting, diet related topics and will go punch a wall 😂😂 | bones oohh eating disorder but i advise that you don't put trigger warning for the first four no one will see food weight fasting diet related topics and will go punch a wall |
| 6 | Boycott Instagram! https://t.co/hasfyGgkUH | boycott instagram |
| 7 | genuinely think abt this all the time i say its ednos bc im not rlly losing weight but also idk what my eating habits would actually look like if i wasnt being forced to eat https://t.co/Ba7PMKW5Uu | genuinely think abt this all the time i say its ednos bc im not rlly losing weight but also idk what my eating habits would actually look like if i wasnt being forced to eat |
| 8 | Now playing Suite pour l'invisible by Ana Roxanne on https://t.co/OICjuh6R5E | now playing suite pour l invisible by ana roxanne on |
| 9 | can yall stop posting that one pic of eugenia as thinspo she was literally on the verge of tears when that pic was taken | can yall stop posting that one pic of eugenia as thinspo she was literally on the verge of tears when that pic was taken |
Excellent! Now our analysis may be a little more accurate. We will next calculate the negative, positive, neutral, and compound values for our analysis, adding it to our df.
#Calculating Negative, Positive, Neutral and Compound values
tw_list[['polarity', 'subjectivity']] = tw_list['text'].apply(lambda Text: pd.Series(TextBlob(Text).sentiment))
for index, row in tw_list['text'].iteritems():
score = SentimentIntensityAnalyzer().polarity_scores(row)
neg = score['neg']
neu = score['neu']
pos = score['pos']
comp = score['compound']
if neg > pos:
tw_list.loc[index, 'sentiment'] = "negative"
elif pos > neg:
tw_list.loc[index, 'sentiment'] = "positive"
else:
tw_list.loc[index, 'sentiment'] = "neutral"
tw_list.loc[index, 'neg'] = neg
tw_list.loc[index, 'neu'] = neu
tw_list.loc[index, 'pos'] = pos
tw_list.loc[index, 'compound'] = comp
tw_list
| 0 | text | polarity | subjectivity | sentiment | neg | neu | pos | compound | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 061221 thinspo https://t.co/y7eVHO8tZ2 | 061221 thinspo | 0.000000 | 0.000 | neutral | 0.000 | 1.000 | 0.000 | 0.0000 |
| 1 | Now playing A Study in Vastness by Ana Roxanne on https://t.co/OICjuh6R5E | now playing a study in vastness by ana roxanne on | 0.000000 | 0.000 | positive | 0.000 | 0.816 | 0.184 | 0.2023 |
| 2 | tw: eating disorder\r\n\r\nreminder that not everyone with an eating disorder is underweight. | tw eating disorder reminder that not everyone with an eating disorder is underweight | 0.000000 | 0.000 | negative | 0.329 | 0.671 | 0.000 | -0.6597 |
| 3 | I came home and my mom made me a Burger but it has Cheese and I Hate Cheese and she got Mad like "Fine then Don't Eat" and I was like Ok I can do that and now she's Accusing Me of Anorexia.\r\n\r\n????? | i came home and my mom made me a burger but it has cheese and i hate cheese and she got mad like fine then don't eat and i was like ok i can do that and now she's accusing me of anorexia | -0.127083 | 0.725 | positive | 0.212 | 0.575 | 0.213 | -0.2263 |
| 4 | @killingklara Thinspo era | thinspo era | 0.000000 | 0.000 | neutral | 0.000 | 1.000 | 0.000 | 0.0000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 12015 | @solskaaa like i said thats a fair point, but theres a huge difference between saying "this is my goal body" and "im terrified of looking like this" but i do agree with you thinspo is very harmful aswell. | like i said thats a fair point but theres a huge difference between saying this is my goal body and im terrified of looking like this but i do agree with you thinspo is very harmful aswell | 0.433333 | 0.700 | positive | 0.121 | 0.595 | 0.283 | 0.6542 |
| 12016 | Collingwood mom opens up on family’s ‘terrifying’ journey through son’s eating disorder https://t.co/Sc8n2AHZQo via @collingwoodtday | collingwood mom opens up on family's terrifying journey through son's eating disorder via | -1.000000 | 1.000 | negative | 0.368 | 0.632 | 0.000 | -0.7506 |
| 12017 | want more ed + sh twt moots :D\r\n\r\n- 16\r\n- ednos\r\n- he / him / cloud\r\n- lgbtq+\r\n- https://t.co/7MRI44ReTt\r\n\r\n↻ / ♡ to be moots !! https://t.co/TI9benwKym | want more ed sh twt moots d 16 ednos he him cloud lgbtq to be moots | 0.500000 | 0.500 | positive | 0.000 | 0.915 | 0.085 | 0.0772 |
| 12018 | #legspo Its a great inspiration for me. #thinspo #edtwt #ana #anorexia #skinny #thighgap https://t.co/oDRXFavgAm | legspo its a great inspiration for me thinspo edtwt ana anorexia skinny thighgap | 0.800000 | 0.750 | positive | 0.000 | 0.571 | 0.429 | 0.8176 |
| 12019 | Conscience. Harmony anorexic-alcoholic-childless beautiful\r\n🖕🏼 \r\n◥ ツ كؤد IخُصمI ツ ◤ IسًتَاﮯلُﮯI"IسًﮯفُﮯI"\r\nCCC_CCC_\r\nWe love a love that was more https://t.co/QDsNSgChUm | conscience harmony anorexic alcoholic childless beautiful i i i i i i ccc ccc we love a love that was more | 0.420000 | 0.640 | positive | 0.000 | 0.400 | 0.600 | 0.9432 |
12020 rows × 9 columns
The sentiment function of textblob returns two properties: polarity and subjectivity.
Polarity is a float which lies in the range of (-1,1) where 1 == a positive statement and -1 == a negative statement.
Subjectivity is also a float which lies in the range of (0,1) where 0 == an objective statement (factual) and 1 == a personal statement (opinion).
With this data, we can see how confident the analysis was in its classifications, and we can also see how factual/opinionated the tweets are - Are users mostly talking about personal experiences and their emotions, or news sources/studies?
#Creating new data frames for all sentiments (positive, negative and neutral)
tw_list_negative = tw_list[tw_list["sentiment"]=="negative"]
tw_list_positive = tw_list[tw_list["sentiment"]=="positive"]
tw_list_neutral = tw_list[tw_list["sentiment"]=="neutral"]
def count_values_in_column(data,feature):
total = data.loc[:,feature].value_counts(dropna=False)
percentage = round(data.loc[:,feature].value_counts(dropna=False,normalize=True)*100,2)
return pd.concat([total, percentage], axis=1, keys=['Total', 'Percentage'])
#Count_values for sentiment
sentiment_values = count_values_in_column(tw_list, "sentiment")
sentiment_values
| Total | Percentage | |
|---|---|---|
| negative | 5601 | 46.60 |
| positive | 4011 | 33.37 |
| neutral | 2408 | 20.03 |
Let's create one more piechart, for a visual.
labels = ['Positive ['+str(sentiment_values.loc['positive', 'Percentage'])+'%]' , 'Neutral ['+str(sentiment_values.loc['neutral', 'Percentage'])+'%]','Negative ['+str(sentiment_values.loc['negative', 'Percentage'])+'%]']
sizes = [sentiment_values.loc['positive', 'Total'], sentiment_values.loc['neutral', 'Total'], sentiment_values.loc['negative', 'Total']]
colors = ['olivedrab', 'cornflowerblue', 'indianred']
patches, texts = plt.pie(sizes,colors=colors, startangle=90)
plt.style.use('default')
plt.legend(labels)
plt.title("Sentiment Analysis Result for ED Hashtags on Twitter")
plt.axis('equal')
plt.show()
Interesting. The percentages changed a little bit:
Let's investigate the polarity and subjectivity values by taking the average of each classification for each sentiment.
def count_polarity_subjectivity(dfs):
d = {'sentiment': ['negative', 'positive', 'neutral'], 'polarity': [], 'subjectivity': []}
for df in dfs:
d['polarity'].append(df.loc[:,"polarity"].mean())
d['subjectivity'].append(df.loc[:,"subjectivity"].mean())
return pd.DataFrame(data=d)
#Count_values for sentiment
sentiment_ps = count_polarity_subjectivity([tw_list_negative, tw_list_positive, tw_list_neutral])
sentiment_ps = pd.DataFrame(sentiment_ps)
sentiment_ps
| sentiment | polarity | subjectivity | |
|---|---|---|---|
| 0 | negative | -0.066023 | 0.424933 |
| 1 | positive | 0.192462 | 0.449668 |
| 2 | neutral | 0.019563 | 0.149912 |
import plotly_express as px
fig = px.scatter(sentiment_ps, x="subjectivity", y="polarity",
hover_data=["polarity", "subjectivity"],
color="sentiment")
fig.update_xaxes(
range=[-1, 1], # sets the range of xaxis
constrain="domain", # meanwhile compresses the xaxis by decreasing its "domain"
)
fig.update_yaxes(
range=[-1, 1], # sets the range of yaxis
)
fig.show()
The polarity of all categories are actually surprisingly close to 0, or a "neutral" coded tweet. However, probably as expected, the "neutral" tweets are much less subjective (opinionated) than the negative/positive tweets.
I would assume the close-to-zero averages can be explained by the fact that a lot of these tweets are simply very different. #edtwt is a large community with a lot of representation. In fact, not everyone on edtwt even necessarily wants to lose weight, although that is the stereotype. According to this analysis, the positive tweets are more positive than the negative tweets are negative. I think this could have arisen from some misclassifications, as we saw earlier when we sampled a couple tweets from each sentiment category, but also the fact that because edtwt is such a negative place (as proven by the majority of tweets being negative), positive tweeters may go out of their way to spread love on the platform.
Let's next investigate which words are used the most in each sentiment with WordClouds.
#Function to Create Wordcloud
def create_wordcloud(text):
#mask = np.array(Image.open("cloud.png"))
wc = WordCloud(background_color="white",
#mask = mask,
max_words=3000,
stopwords=set(STOPWORDS),
colormap='ocean',
repeat=True)
wc.generate(str(text))
wc.to_file("wc.png")
display(Image.open("wc.png"))
#Scraping text of all twts for wordcloud
all_twt = ""
for v in tw_list['text'].values:
all_twt += str(v)
#Creating wordcloud for all tweets
alll = create_wordcloud(all_twt)
This is pretty much expected. Majority of tweets mention eating disorders, anorexia (the most common ED in the US), and "thinspo," or "thin inspiration." Thinspo is typically the opposite of the earlier-described "fatspo" posts, where a photo of a exceptionally thin user is posted along with a caption - usually either simply "thinspo" or something along the lines of "I wish I looked like this." From my time scrolling through edtwt timelines, the posts are typically personal stories ("starting a fast today!", "i binged today, i feel like a failure", etc.) or thinspo.
Let's take a look at the positive tweets' composition.
#Scraping text of positive twts for wordcloud
pos_twt = ""
for v in tw_list_positive["text"].values:
pos_twt += str(v)
#Creating wordcloud for positive sentiment
create_wordcloud(pos_twt)
Again, "eating disorder" and "anorexic" are used often, but the words are certainly different overall. "Love," "help," "eat", "friend," and "good," for example, promote a pro-recovery environment in my mind.
What about the negative tweets?
#Scraping text of positive twts for wordcloud
neg_twt = ""
for v in tw_list_negative["text"].values:
neg_twt += str(v)
#Creating wordcloud for negative sentiment
create_wordcloud(neg_twt)
A little different from the positive tweets. "Body," "weight," "fat," "hate," and multiple swears show the side of edtwt that is actively struggling.
It's interesting that the words differ like this, but they certainly don't differ a lot. The phrase "eating disorder" is still prevalent in each of the wordclouds regardless of the sentiment, and there's always a large emphasis on specifically anorexia. All in all, whether the tweet is positive or not, as expected, edtwt is focused entirely on weight, appearance, and eating.
Let's also quickly compare the tweets' length and word count for each sentiment, to see if they vary at all.
tw_list['text_len'] = tw_list['text'].astype(str).apply(len)
tw_list['text_word_count'] = tw_list['text'].apply(lambda x: len(str(x).split()))
round(pd.DataFrame(tw_list.groupby("sentiment").text_len.mean()),2)
| text_len | |
|---|---|
| sentiment | |
| negative | 137.51 |
| neutral | 55.97 |
| positive | 132.06 |
round(pd.DataFrame(tw_list.groupby("sentiment").text_word_count.mean()),2)
| text_word_count | |
|---|---|
| sentiment | |
| negative | 25.38 |
| neutral | 10.15 |
| positive | 24.61 |
tw_samples = pd.concat([tw_list_negative[:3], tw_list_positive[:3], tw_list_neutral[:3]])
tw_samples
| 0 | text | polarity | subjectivity | sentiment | neg | neu | pos | compound | |
|---|---|---|---|---|---|---|---|---|---|
| 2 | tw: eating disorder\r\n\r\nreminder that not everyone with an eating disorder is underweight. | tw eating disorder reminder that not everyone with an eating disorder is underweight | 0.000000 | 0.000000 | negative | 0.329 | 0.671 | 0.000 | -0.6597 |
| 5 | @Nias_bones Oohh Eating disorder, but I advise that you don't put trigger warning for the first four. No one will see food, weight, fasting, diet related topics and will go punch a wall 😂😂 | bones oohh eating disorder but i advise that you don't put trigger warning for the first four no one will see food weight fasting diet related topics and will go punch a wall | 0.125000 | 0.366667 | negative | 0.132 | 0.795 | 0.073 | -0.2723 |
| 6 | Boycott Instagram! https://t.co/hasfyGgkUH | boycott instagram | 0.000000 | 0.000000 | negative | 0.697 | 0.303 | 0.000 | -0.3182 |
| 1 | Now playing A Study in Vastness by Ana Roxanne on https://t.co/OICjuh6R5E | now playing a study in vastness by ana roxanne on | 0.000000 | 0.000000 | positive | 0.000 | 0.816 | 0.184 | 0.2023 |
| 3 | I came home and my mom made me a Burger but it has Cheese and I Hate Cheese and she got Mad like "Fine then Don't Eat" and I was like Ok I can do that and now she's Accusing Me of Anorexia.\r\n\r\n????? | i came home and my mom made me a burger but it has cheese and i hate cheese and she got mad like fine then don't eat and i was like ok i can do that and now she's accusing me of anorexia | -0.127083 | 0.725000 | positive | 0.212 | 0.575 | 0.213 | -0.2263 |
| 7 | genuinely think abt this all the time i say its ednos bc im not rlly losing weight but also idk what my eating habits would actually look like if i wasnt being forced to eat https://t.co/Ba7PMKW5Uu | genuinely think abt this all the time i say its ednos bc im not rlly losing weight but also idk what my eating habits would actually look like if i wasnt being forced to eat | 0.033333 | 0.266667 | positive | 0.041 | 0.750 | 0.209 | 0.7552 |
| 0 | 061221 thinspo https://t.co/y7eVHO8tZ2 | 061221 thinspo | 0.000000 | 0.000000 | neutral | 0.000 | 1.000 | 0.000 | 0.0000 |
| 4 | @killingklara Thinspo era | thinspo era | 0.000000 | 0.000000 | neutral | 0.000 | 1.000 | 0.000 | 0.0000 |
| 11 | we anorexic 😝 https://t.co/tWO3XldV6N | we anorexic | 0.000000 | 0.000000 | neutral | 0.000 | 1.000 | 0.000 | 0.0000 |
Yes, the negative and positive tweets differ from the neutral tweets. The neutral tweets, as I expected, are mostly just captions to images - usually thinspo. The negative tweets are actually not very negative, but use language that can be coded as such: in the joking phrase "nobody's going to go punch a wall," signified to be a joke by the laughing emojis afterwards, the word "punch" is probably going to be coded as negative, although the main message of the tweet is actually correcting another user from not needing to put a trigger warning. The positive tweets, however, are more negative in my opinion: users being "forced" to eat, one positive tweet is actually misclassified as it references a singer named "Ana", not the shorthand of "anorexic."
All in all, the neutral tweets are captions, so they are shorter. The positive and negative tweets are replies to threads or personal stories, so not only are they more subjective, but they are also longer. interesting!
For the last part of the analysis, I wanted to continue on the line of finding which words were used the most often, as I did with the WordClouds. For this part, I will use tokenization.
#Appliyng tokenization
def tokenization(text):
text = re.split('\W+', text)
return text
tw_list['tokenized'] = tw_list['text'].apply(lambda x: tokenization(x.lower()))
#Removing stopwords
stopword = nltk.corpus.stopwords.words('english')
def remove_stopwords(text):
text = [word for word in text if word not in stopword]
return text
tw_list['nonstop'] = tw_list['tokenized'].apply(lambda x: remove_stopwords(x))
#Appliyng Stemmer
ps = nltk.PorterStemmer()
def stemming(text):
text = [ps.stem(word) for word in text]
return text
tw_list['stemmed'] = tw_list['nonstop'].apply(lambda x: stemming(x))
#Cleaning Text
def clean_text(text):
text_lc = "".join([word.lower() for word in text if word not in string.punctuation]) # remove puntuation
text_rc = re.sub('[0-9]+', '', text_lc)
tokens = re.split('\W+', text_rc) # tokenization
text = [ps.stem(word) for word in tokens if word not in stopword] # remove stopwords and stemming
return text
tw_list.head()
| 0 | text | polarity | subjectivity | sentiment | neg | neu | pos | compound | text_len | text_word_count | tokenized | nonstop | stemmed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 061221 thinspo https://t.co/y7eVHO8tZ2 | 061221 thinspo | 0.000000 | 0.000 | neutral | 0.000 | 1.000 | 0.000 | 0.0000 | 14 | 2 | [061221, thinspo] | [061221, thinspo] | [061221, thinspo] |
| 1 | Now playing A Study in Vastness by Ana Roxanne on https://t.co/OICjuh6R5E | now playing a study in vastness by ana roxanne on | 0.000000 | 0.000 | positive | 0.000 | 0.816 | 0.184 | 0.2023 | 49 | 10 | [now, playing, a, study, in, vastness, by, ana, roxanne, on] | [playing, study, vastness, ana, roxanne] | [play, studi, vast, ana, roxann] |
| 2 | tw: eating disorder\r\n\r\nreminder that not everyone with an eating disorder is underweight. | tw eating disorder reminder that not everyone with an eating disorder is underweight | 0.000000 | 0.000 | negative | 0.329 | 0.671 | 0.000 | -0.6597 | 84 | 13 | [tw, eating, disorder, reminder, that, not, everyone, with, an, eating, disorder, is, underweight] | [tw, eating, disorder, reminder, everyone, eating, disorder, underweight] | [tw, eat, disord, remind, everyon, eat, disord, underweight] |
| 3 | I came home and my mom made me a Burger but it has Cheese and I Hate Cheese and she got Mad like "Fine then Don't Eat" and I was like Ok I can do that and now she's Accusing Me of Anorexia.\r\n\r\n????? | i came home and my mom made me a burger but it has cheese and i hate cheese and she got mad like fine then don't eat and i was like ok i can do that and now she's accusing me of anorexia | -0.127083 | 0.725 | positive | 0.212 | 0.575 | 0.213 | -0.2263 | 186 | 43 | [i, came, home, and, my, mom, made, me, a, burger, but, it, has, cheese, and, i, hate, cheese, and, she, got, mad, like, fine, then, don, t, eat, and, i, was, like, ok, i, can, do, that, and, now, she, s, accusing, me, of, anorexia] | [came, home, mom, made, burger, cheese, hate, cheese, got, mad, like, fine, eat, like, ok, accusing, anorexia] | [came, home, mom, made, burger, chees, hate, chees, got, mad, like, fine, eat, like, ok, accus, anorexia] |
| 4 | @killingklara Thinspo era | thinspo era | 0.000000 | 0.000 | neutral | 0.000 | 1.000 | 0.000 | 0.0000 | 11 | 2 | [thinspo, era] | [thinspo, era] | [thinspo, era] |
In order to count the words, we can use the CountVectorizer similar to the classification from Assignment 7.
#Appliyng Countvectorizer
countVectorizer = CountVectorizer(analyzer=clean_text)
countVector = countVectorizer.fit_transform(tw_list['text'])
print('{} Number of reviews has {} words'.format(countVector.shape[0], countVector.shape[1]))
#print(countVectorizer.get_feature_names())
count_vect_df = pd.DataFrame(countVector.toarray(), columns=countVectorizer.get_feature_names())
count_vect_df.head()
12020 Number of reviews has 11220 words
| aaa | aaaa | aaaaaa | aaaaaaaaaaaa | aaaahahah | aaah | aarrfghh | ab | abandon | ... | zolpidem | zombi | zone | zoom | zoophilia | zucchini | zuckerberg | zyprexa | zzz | zzzzz | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 11220 columns
Let's first find the most used single words.
# Most Used Words
count = pd.DataFrame(count_vect_df.sum())
countdf = count.sort_values(0,ascending=False).head(20)
singles = pd.DataFrame(countdf[1:11]).reset_index()
singles.rename(columns={'index':'wordstem', 0:'value'}, inplace=True)
singles
| wordstem | value | |
|---|---|---|
| 0 | disord | 4562 |
| 1 | thinspo | 3030 |
| 2 | anorexia | 2217 |
| 3 | like | 2017 |
| 4 | anorex | 1945 |
| 5 | im | 1826 |
| 6 | edtwt | 1073 |
| 7 | dont | 949 |
| 8 | get | 852 |
| 9 | peopl | 852 |
fig = px.bar(singles, x="wordstem", y="value")
fig.show() # to show the plot
Users discuss their disorders and personal experiences the most, as we've seen earlier. What about longer phrases? Let's find the most common 2- and 3-word phrases using an n-gram.
#Function to ngram
def get_top_n_gram(corpus, ngram_range, n=None):
vec = CountVectorizer(ngram_range=ngram_range,stop_words = 'english').fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
#n2_bigram
n2_bigrams = get_top_n_gram(tw_list['text'],(2,2),20)
doubles = pd.DataFrame(n2_bigrams)
doubles.rename(columns={0:'phrase', 1:'value'}, inplace=True)
doubles
| phrase | value | |
|---|---|---|
| 0 | eating disorder | 3889 |
| 1 | pro ana | 382 |
| 2 | eating disorders | 253 |
| 3 | edtwt thinspo | 194 |
| 4 | thinspo thread | 180 |
| 5 | mental health | 176 |
| 6 | feel like | 158 |
| 7 | best anorexic | 150 |
| 8 | binge eating | 149 |
| 9 | anorexia nervosa | 128 |
| 10 | having eating | 108 |
| 11 | look like | 107 |
| 12 | weight loss | 100 |
| 13 | tw eating | 99 |
| 14 | anorexia bulimia | 97 |
| 15 | lose weight | 87 |
| 16 | thinspo edtwt | 86 |
| 17 | gt gt | 81 |
| 18 | don know | 80 |
| 19 | self harm | 77 |
fig = px.bar(doubles, x="phrase", y="value")
fig.show() # to show the plot
Just like the wordclouds, "eating disorder" is the most common. However, "pro ana" is also extremely high on the list, showing that the community is not very pro-recovery. Moreover, there are plenty of mentions of "feeling like" or "looking like" a certain "thinspo thread", so it's an environment that highly encourages competition and comparison. This is extremely dangerous for young, impressionable teenagers, who are coincidentally the most at-risk for developing eating disorders.
As for the 3-word phrases:
#n3_trigram
n3_trigrams = get_top_n_gram(tw_list['text'],(3,3),20)
triples = pd.DataFrame(n3_trigrams)
triples.rename(columns={0:'phrase', 1:'value'}, inplace=True)
triples
| phrase | value | |
|---|---|---|
| 0 | binge eating disorder | 126 |
| 1 | having eating disorder | 107 |
| 2 | tw eating disorder | 91 |
| 3 | eating disorder recovery | 74 |
| 4 | eating disorder just | 67 |
| 5 | like eating disorder | 55 |
| 6 | eating disorder like | 51 |
| 7 | eating disorder treatment | 50 |
| 8 | eating disorder twitter | 46 |
| 9 | gt gt gt | 44 |
| 10 | don eating disorder | 38 |
| 11 | think eating disorder | 37 |
| 12 | develop eating disorder | 37 |
| 13 | struggling eating disorder | 36 |
| 14 | developed eating disorder | 33 |
| 15 | sexual abuse anorexia | 32 |
| 16 | abuse anorexia domestic | 32 |
| 17 | anorexia domestic violence | 32 |
| 18 | domestic violence mental | 32 |
| 19 | violence mental health | 32 |
fig = px.bar(triples, x="phrase", y="value")
fig.show() # to show the plot
As I mentioned earlier, anorexia and bulimia are not the only eating disorders on edtwt. Binge eating disorder is a disorder in which a user consumes large amounts of food (usually far past the point of feeling full) in a short period of time, known as "binges." This disorder usually results from an "all or nothing" mentality and goes hand-in-hand with high restrictive practices in which the person restricts so heavily that they cave to the pressure and pysical/mental strain, therefore consuming as much "unallowed" food as possible before they must "reset" the next day.
There are also many mentions of abuse. Let's look into those:
abuse = tw_list[tw_list['text'].str.contains('abuse|violence')]
abuse.head()
| 0 | text | polarity | subjectivity | sentiment | neg | neu | pos | compound | text_len | text_word_count | tokenized | nonstop | stemmed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 180 | ꒰🧠 eda clawthorne from the owl house suffers from ednos and is a self harmer! (substance abuse, self-burning) [hcs] https://t.co/vDaryUeZX6 | eda clawthorne from the owl house suffers from ednos and is a self harmer substance abuse self burning hcs | -0.60000 | 0.700000 | negative | 0.313 | 0.687 | 0.000 | -0.8074 | 106 | 19 | [eda, clawthorne, from, the, owl, house, suffers, from, ednos, and, is, a, self, harmer, substance, abuse, self, burning, hcs] | [eda, clawthorne, owl, house, suffers, ednos, self, harmer, substance, abuse, self, burning, hcs] | [eda, clawthorn, owl, hous, suffer, edno, self, harmer, substanc, abus, self, burn, hc] |
| 232 | and last year in 2020 I exposed my abuser who is known for having an ED and i got directed with a lot of harassment/cruelty which over time built up that gave me urges to go back to anorexia and all of that and I kept trying to resist that but suddenly everyone | and last year in 2020 i exposed my abuser who is known for having an ed and i got directed with a lot of harassment cruelty which over time built up that gave me urges to go back to anorexia and all of that and i kept trying to resist that but suddenly everyone | 0.00000 | 0.188889 | negative | 0.151 | 0.849 | 0.000 | -0.7311 | 261 | 54 | [and, last, year, in, 2020, i, exposed, my, abuser, who, is, known, for, having, an, ed, and, i, got, directed, with, a, lot, of, harassment, cruelty, which, over, time, built, up, that, gave, me, urges, to, go, back, to, anorexia, and, all, of, that, and, i, kept, trying, to, resist, that, but, suddenly, everyone] | [last, year, 2020, exposed, abuser, known, ed, got, directed, lot, harassment, cruelty, time, built, gave, urges, go, back, anorexia, kept, trying, resist, suddenly, everyone] | [last, year, 2020, expos, abus, known, ed, got, direct, lot, harass, cruelti, time, built, gave, urg, go, back, anorexia, kept, tri, resist, suddenli, everyon] |
| 234 | 6 — \r\ntw // eating disorder, parent abuse, abusive relationship, breakups https://t.co/SjkKbZyreL | 6 tw eating disorder parent abuse abusive relationship breakups | 0.00000 | 0.000000 | negative | 0.689 | 0.311 | 0.000 | -0.9022 | 63 | 9 | [6, tw, eating, disorder, parent, abuse, abusive, relationship, breakups] | [6, tw, eating, disorder, parent, abuse, abusive, relationship, breakups] | [6, tw, eat, disord, parent, abus, abus, relationship, breakup] |
| 579 | @shacks_an_shite @australian When will they all deny obese who refuse to manage their eating disorder. Also cancer patients who abused their skin and went out too much in the sun. And anyone who’s developed any other disease from personal decisions. #privatisehealthcare | an shite when will they'all deny obese who refuse to manage their eating disorder also cancer patients who abused their skin and went out too much in the sun and anyone who's developed any other disease from personal decisions privatisehealthcare | 0.04375 | 0.293750 | negative | 0.300 | 0.700 | 0.000 | -0.9325 | 246 | 40 | [an, shite, when, will, they, all, deny, obese, who, refuse, to, manage, their, eating, disorder, also, cancer, patients, who, abused, their, skin, and, went, out, too, much, in, the, sun, and, anyone, who, s, developed, any, other, disease, from, personal, decisions, privatisehealthcare] | [shite, deny, obese, refuse, manage, eating, disorder, also, cancer, patients, abused, skin, went, much, sun, anyone, developed, disease, personal, decisions, privatisehealthcare] | [shite, deni, obes, refus, manag, eat, disord, also, cancer, patient, abus, skin, went, much, sun, anyon, develop, diseas, person, decis, privatisehealthcar] |
| 739 | Autopsy says Colorado cult leader whose mummified and eyeless corpse was found in a room filled with flashing lights died of alcohol abuse, anorexia, and colloidal silver poisoning...basically "death of natural causes" as far as cults are… https://t.co/nQYW2cRhqW [Fark] | autopsy says colorado cult leader whose mummified and eyeless corpse was found in a room filled with flashing lights died of alcohol abuse anorexia and colloidal silver poisoning basically death of natural causes as far as cults are fark | 0.20000 | 0.766667 | negative | 0.358 | 0.596 | 0.047 | -0.9565 | 237 | 39 | [autopsy, says, colorado, cult, leader, whose, mummified, and, eyeless, corpse, was, found, in, a, room, filled, with, flashing, lights, died, of, alcohol, abuse, anorexia, and, colloidal, silver, poisoning, basically, death, of, natural, causes, as, far, as, cults, are, fark] | [autopsy, says, colorado, cult, leader, whose, mummified, eyeless, corpse, found, room, filled, flashing, lights, died, alcohol, abuse, anorexia, colloidal, silver, poisoning, basically, death, natural, causes, far, cults, fark] | [autopsi, say, colorado, cult, leader, whose, mummifi, eyeless, corps, found, room, fill, flash, light, die, alcohol, abus, anorexia, colloid, silver, poison, basic, death, natur, caus, far, cult, fark] |
Investigating these tweets, they are (in order) talking about a character from a tv series, a personal recovery story, a fanfiction, advocating for better healthcare, and a news story about a Colorado cult leader who died earlier this year. The tweets mentioning abuse are varied in content, in other words, and seem to be mostly reporting on other people (hypothetical, like in the healthcare tweet, or fictional, like in the tv series and fanfiction tweets). In other words, it doesn't seem like edtwt is being used to mostly talk about one's own abuse, but rather the abuse of others.
However, "eating disorder recovery" is also high on the list. This is promising, as it suggests a fair amount of users are either promoting or at least starting a dialogue about recovery. However, the community in general has still proven to be extremely toxic and dangerous, especially for young, vulnerable minds.
This data was taken over a 6-day period from Twitter. I chose to research Twitter not because I think it has a particularly awful community, but simply because of the time constraints and it was the easiest API to sign up for. Since a lot of twitter seems to be focused on sharing thinspo images or personal stories, I think Instagram and Tumblr would be good platforms to research as well, given that they were both founded on the sharing of images. Instagram deprecated their API in 2020 and Tumblr's API was very difficult to sign up for, but I think those two would absolutely offer new interesting insights into online ED communities.
Moreover, I just scraped data once to study the nature of the ED community. Eni suggested keeping a Twitter Stream API open, and monitoring which posts were taken down, so I could therefore monitor Twitter's response to the nature of the ED community. Again, due to time constraints, I was unable to figure out how to do that in time, but I think that is the logical next step in this research topic.
My final thought surrounding the design of my research approach is that 1) some irrelevant data was included, like the tweet about the singer named Ana, and 2) some tweets I think were misclassified as positive when they should be negative and vice versa. As for the first issue, I expected this, as edtwt users will purposefully use difficult-to-scrape hashtags and keywords to avoid detection from people like me. Nonetheless, I'm sure the query I used in the API can be further refined to combat this. As for the second issue, I think that's more an issue with the sentiment analyzing modules I chose to use, and therefore not much can be done other than using another module or classifying by hand, etc.
In conclusion, I think #edtwt is a very dangerous, toxic, terrible online community. The majority of tweets are negative, often describing a user's self-hatred for how "fat" they are. Even the neutral tweets are shielded in negativity, as they are most often captions of dangerously thin women, urging young users to fast more, eat less, and be as thin as them someday. There are discussions of committing suicide, self-harm, starvation, suggestions to go on diets of 350 calories a day, and other terrible threads. I am exceptionally worried about the fact that the majority of Twitter users are young, impressionable teenagers who could be easily gaslit and influenced by harmful messages and communities like #edtwt.
Other studies like mine have been done before, and I think it is more than safe to conclude that hashtags like #proana and tweets containing keywords like "eating disorder" should be monitored. As I stated earlier, I think the next step would be to see how Twitter is monitoring these tweet feeds, and find suggestions on how to improve content deletion.